Molecular Systems Biology — Latest Matching Preprints

1

Robotic perturbation proteomics and AI agents enable scalable drug mechanism discovery

Jiang, Y.; Movassaghi, C. S.; Munoz-Estrada, J.; Sundararaman, N.; Momenzadeh, A.; Meyer, J. G.

2026-05-07 systems biology 10.64898/2026.05.04.722718 medRxiv

Top 0.1%

30.8%

Show abstract

Large-scale mass spectrometry-based proteomic screening could reveal cellular mechanisms of drug action at systems resolution but remains limited by experimental complexity and the difficulty of extracting insight from high-dimensional datasets. Here, we describe an end-to-end platform that combines semi-automated sample preparation, rapid LC-MS/MS, and AI agent-based data analysis to enable scalable proteomic screening. In a screen of 172 compounds in HepG2 cells, we generated 1,232 proteomes with more than 8,700 quantified proteins in approximately three weeks. Agentic AI reduced data analysis and interpretation time to less than one day while translating proteomic measurements into structured mechanism-oriented summaries and experimentally testable hypotheses. Guided by this framework, we validated: (1) a cholesterol-lowering effect of methylene blue in vitro and (2) an association between loratadine exposure and increased circulating iron in matched electronic health record analyses. This work establishes a scalable platform for generating proteomic drug perturbation data and automatically converting that data into mechanistic insights and candidate translational hypotheses using AI.

2

Curating MitoCore: A Standardized Small-Scale Human Metabolic Model as Platform for Proteomics Integration and Disease Modeling

Lange, E.; Santamaria, A. B. R.; Heyer, R.

2026-07-09 systems biology 10.64898/2026.06.29.734258 medRxiv

Top 0.1%

22.6%

Show abstract

MotivationCentral human metabolism powers cellular processes, yet its dysregulation in disease remains poorly understood. While comprehensive genome-scale metabolic models like Human-GEM are available, their size limits interpretability and computational efficiency. Conversely, the smaller MitoCore model is more manageable but lacks the standardized annotations and curated gene-protein-reaction (GPR) associations necessary for omics integration like protein-constrained modeling. Improving MitoCores annotation quality is therefore essential for its use in integrative workflows. ResultsWe systematically updated MitoCore to enhance compatibility with the protein-constrained modeling framework sMO-MENT. By restructuring legacy annotations and integrating data from Human-GEM and MitoMammal, we increased EC-codes from 354 to 593 and UniProt-annotated genes from 0 to 592. MitoCore captures central metabolic processes, confirmed by mapping its reactions to 51 of 106 metabolic KEGG modules. Integration of thrombocyte proteomics and experimental ATP data for original and curated models showed an increase in mapped proteins (228 to 294) and reactions with kcat values (295 to 310), adding 43 protein-constrained reactions. Consequently, prediction errors for exchange fluxes and ATP production decreased by 19% and 88%, respectively, with 100% of ATP predictions falling within the 95% confidence interval (compared to 16% for the original model). Finally, we implemented a continuous integration/continuous deployment pipeline for automated updates from future Human-GEM releases. These improvements provide a computationally efficient, well-annotated model for studying central metabolism across human cell types. Availability and ImplementationAll source code for reproducing results from this paper is available at https://doi.org/10.5281/zenodo.20813825.

3

Protein Stability, Turnover Kinetics, and Abundance Constrain the Scaling of Protein Interaction Networks

Goel, M.; Nissley, D. A.; Castellanos-Girouard, X.; Kuntz, C. P.; Wang, Y.; Mukhtar, M. S.; Serohijos, A.; Schlebach, J. P.

2026-05-14 systems biology 10.64898/2026.05.11.724303 medRxiv

Top 0.1%

18.9%

Show abstract

The propensity of proteins to form oligomers is ultimately dictated by their structural configuration(s). Proteins that persist in a discrete conformational state may form a limited number of specific interactions while those that sample a broader structural ensemble may instead associate with a wider array of partners. These intrinsic tendencies potentially constrain the way proteins navigate wider interaction networks. In this work, we aggregated and surveyed a wide variety of biophysical, biochemical, and cellular descriptors of the S. cerevisiae proteome to identify biases in the connectivity of its protein-protein interaction network. Using mass spectrometry-based interactome measurements and various protein stability estimates, we find that a disproportionate number of abundant, yet unstable binding proteins act as network hubs. Moreover, we show that these features alone can be used to discriminate between hubs and non-hub proteins with high accuracy (AUROC = 0.898). Interestingly, we find that half-lives of hub proteins depend on whether or not they reside within static complexes and/ or whether they interact with molecular chaperones. Finally, we note that the observed connectivity biases associated with abundant, unstable proteins only pertain to network hubs, but not to the bottlenecks that connect them. Together, our findings reveal how the conformational stability of a protein may constrain its context within protein-protein interaction networks.

4

Shield-4i: A Whole-mount Multiplexed Imaging Platform for Studying Multiscale Information Flow in 3D Multicellular Systems

Hornbachner, R.; Shamipour, S.; Arslan, F. N.; Fan, R.; Hess, M.; Curvaia, F.; Lüthi, J.; Oates, A. C.; Bedzhov, I.; Gilmour, D.; Uhlmann, V.; Pelkmans, L.

2026-07-09 systems biology 10.64898/2026.07.01.735871 medRxiv

Top 0.1%

15.3%

Show abstract

Self-organization in multicellular systems emerges from reciprocal interactions across spatiotemporal scales. Understanding how subcellular organization, tissue remodeling and developmental outcome are coordinated, thus requires simultaneous profiling of biological processes spanning orders of magnitudes in space and time. Yet, a unified experimental and computational framework for capturing these multiscale properties across in vivo and stem cell-derived systems has been lacking. Here, we introduce Shield-4i, a high-throughput, versatile, and accessible method for automated in toto iterative immunofluorescence imaging of whole-mount structures at subcellular resolution. Through polyepoxide-mediated inter- and intramolecular crosslinking, Shield-4i preserves sample integrity during repeated SDS-based elution cycles. We benchmark this method in gastrulating zebrafish and post-implantation mouse embryos and demonstrate its applicability to stem cell-derived 3D gastruloids, achieving up to 30-plex measurements of proteins and their post-translational modifications across hundreds of samples. To enable scalable analysis, we developed a dedicated 3D workflow supporting OME-Zarr-based and FAIR-compliant data storage, standardized processing, and multiscale feature extraction. Applying this framework to investigate gastruloid self-organization, we quantify how cellular physicochemical state and signaling properties encode cell position along embryonic axes and connect molecular patterning and fate decisions to morphological symmetry breaking at the multicellular scale. Together, Shield-4i provides a high-content in toto spatial proteomics platform for dissecting multiscale information flow and self-organization in multicellular systems.

5

PrEditR: A protein-centric platform for CRISPR-mediated base editor sgRNA design

Myers, S. A.; Vasquez Castro, F.; Sanchez Solis, L. D.

2026-05-16 bioinformatics 10.64898/2026.05.15.725600 medRxiv

Top 0.1%

15.3%

Show abstract

MotivationPost-translational modifications (PTMs) are critical to protein function, yet the function of most known modification sites remains uncharacterized. CRISPR-mediated phenotypic screens using base editors offer a powerful approach to dissecting PTM function at scale. However, existing sgRNA design tools for base editing applications are DNA-centric and lack the throughput required to integrate seamlessly with mass-spectrometry-based proteomics experimental outputs. ResultsWe introduce protein editing in R, PrEditR, an open-source, protein-centric tool for high-throughput sgRNA design for custom base editor screens. PrEditR enables users to designate specific amino acid residues in proteins and design protospacer sequences to target the endogenous gene to install missense mutations via base editors. Availability and ImplementationPrEditR is available on GitHub and Docker Hub.

6

Global quantification of mammalian gene expression noise

Welter, A. S.; Mutschler, F.; Simon, M.; Giacomelli, C.; Branscheid, A.-C.; Manukyan, A.; Teixeira Alves, L. G.; Gerwien, M.; Kerridge, R.; Landthaler, M.; Wolf, J.; Selbach, M.

2026-05-14 systems biology 10.64898/2026.05.11.724258 medRxiv

Top 0.1%

15.2%

Show abstract

Even cells of the same type growing in the same environment show cell-to-cell differences in protein abundance, a phenomenon known as gene expression noise. This variability can be decomposed into intrinsic components, reflecting molecular randomness, and extrinsic components, arising from differences in cellular state. While gene expression noise has been studied genome-wide in microbes, its global organization remains largely unknown in mammalian cells. Here, we develop a spike-in-based stable isotope single-cell proteomics approach that enables robust quantification of protein-level gene expression noise across thousands of human proteins. We find that protein noise scales inversely with abundance until reaching a plateau, consistent with an extrinsic noise floor and conserved scaling principles observed in bacteria and yeast. Cell cycle stage and cell size contribute substantially to protein variability but do not fully account for the observed heterogeneity. Gene-specific features such as mRNA and protein half-lives and translation efficiency show only weak associations with protein noise, and variability at the mRNA level is a weak predictor of protein variability. Instead, protein noise is largely extrinsic, with coordinated variation across proteins encoding biologically organized cellular states. Consistently, coordinated proteome programs predict intercellular differences in proteome dynamics, linking protein variability to cellular function. Together, these results provide a proteome-wide view of gene expression noise in mammalian cells, establishing that protein-level variability encodes structured and functionally relevant differences in cellular state.

7

Context-dependent molecular responses to heterogeneous metabolic disease traits

Michalettou, T.-D.; Vinuela, A.

2026-06-08 endocrinology 10.64898/2026.05.31.26354544 medRxiv

Top 0.1%

15.1%

Show abstract

Metabolic diseases such as type 2 diabetes (T2D) arise through complex interactions between physiological, molecular, and environmental processes. Clinical traits including age, sex, adiposity, and glycaemic status are strongly associated with disease risk and progression, yet most molecular studies examine these factors independently and assume relatively static molecular regulation. Consequently, how physiological state dynamically reshapes molecular organisation across omics layers remains poorly understood. Here, we integrated transcriptomic, proteomic, metabolomic, and genetic data from 3,027 individuals in the IMI DIRECT cohort to characterise the joint molecular effects of age, sex, body mass index (BMI), and glycated haemoglobin (HbA1c). We identified widespread associations between these traits and molecular phenotypes. However, interaction analyses revealed a more complex context-dependent regulation, showing that the molecular effect of one trait frequently depends on the state of another, with sex-specific effects of age being more prominent. We also investigated relationships between different types of molecular phenotypes and how these relationships are modulated by metabolic disease relevant traits, demonstrating that cross-omic molecular coordination is itself dynamically remodelled by physiological and metabolic state. Probabilistic causal inference identified a directionally structured network of age-associated molecules, revealing pathways through which age effects propagate across omics layers, showcased in the example of the mTOR signalling pathway. Integration of this directed network with genetic colocalisation analyses also identified a sub-network relevant for T2D. Collectively, our findings demonstrate that metabolic disease relevant traits not only independently influence molecular phenotype abundance but also jointly reshape the directional organisation of cross-omic molecular networks. These results support a model in which metabolic disease susceptibility emerges through dynamic rewiring of interconnected molecular systems and provide a framework for context-dependent biomarker discovery, disease stratification, and precision metabolic medicine.

8

Machine Learning Gap-Fills Missing Transporter Kinetics in Biosystems Across Scales

Qiu, S.; Guo, Z.; Tu, W.; Zhuang, Y.; Wu, S.; Wang, G.

2026-07-07 systems biology 10.64898/2026.07.02.735998 medRxiv

Top 0.1%

15.1%

Show abstract

Understanding transporter kinetics is essential for deciphering metabolite exchanges in biosystems, particularly for cells subject to substrate gradients. Nevertheless, the prediction of transporter kinetic parameters, maximum rate per gram protein (Vmax) and Michaelis-Menten constant (Km), has not yet been tackled. Here, we developed the first compound-protein interaction machine learning model of transporter Vmax and Km, MMTKPred, which achieved R2=0.553, RMSE=1.155 mmol/hr/g Protein and R2=0.330, RMSE=0.935 mM for log10-scaled Vmax and Km prediction, respectively. Moreover, we demonstrated MMTKPred's predictive power across biosystem scales, from capturing transporter kinetics modulated by point mutations and substrate changes at the molecular level, to enabling substrate-sensitive metabolic modelling of non-model yeasts at the cellular level, and rationalizing inter-species substrate competition in co-cultures. Collectively, MMTKPred effectively models metabolite transport spanning from molecular to multi-species scales, thereby offering a computational tool for rational microbial cell factory optimization.

9

The Case for Kinases:A Phosphorylation Driven Model for Circadian Temperature Compensation

Stevenson, E.-L.; Kelliher, C. M.; Kettenbach, A. N.; Loros, J. J.; Dunlap, J. C.

2026-05-09 genetics 10.64898/2026.05.07.723636 medRxiv

Top 0.1%

12.6%

Show abstract

Circadian rhythms, [~]24-hour biological cycles, enable organisms to anticipate rhythmic environmental cycles so they can assign proper day and night functions that align with those cycles. Circadian rhythms are defined by their ability to be reset by external cues, their capacity to continue to oscillate in the absence of those cues, and their capacity to maintain the rate of the clock across a range of ambient temperatures, a property known as temperature compensation. In the Neurospora clock, the White Collar Complex (WCC) drives expression of FRQ which nucleates a complex including FRH and CK1a that phosphorylates and thereby represses WCC activity. Work to date has suggested that kinases may be involved in temperature compensation and that in Neurospora the primary target of these is FRQ. Here we investigate the genetic relationship between two clock kinases, Casein Kinase I (ck-1a) and Casein Kinase II (cka), in their regulation of temperature compensation using novel alleles, ck-1aD135G and {Delta}cka. We find that that the clock relies on Casein Kinase I more at cold temperature, but this changes as temperature increases, and the clock relies more on Casein Kinase II at warm temperatures. Using quantitative proteomics on FRQ across temperatures, we find that the FRQ phosphorylation landscape is dependent on temperature and is altered in temperature compensation mutants. This leads to the development of a phosphorylation driven model for temperature compensation, where key temperature compensation specific domains on FRQ are phosphorylated to regulate period length in response to temperature, including by Casein Kinase I and Casein Kinase II.

10

Spatiotemporal remodeling of cytoskeletal and junction networks during somatic cell reprogramming

Samson, R.; Kitaygorodsky, J.; Tersigni, M.; Tursun, T.; Hu, Q.; Hardy, W. R.; Trcka, D.; Rost, H.; Wrana, J. L.; Samavarchi-Tehrani, P.; Gingras, A.-C.

2026-05-28 systems biology 10.64898/2026.05.26.725436 medRxiv

Top 0.1%

12.5%

Show abstract

Summary/AbstractReprogramming somatic cells into induced pluripotent stem cells involves a dramatic reorganization of the cytoskeleton and junctions during the critical mesenchymal-to-epithelial transition stage. While protein abundance changes have been profiled, the spatiotemporal dynamics of protein-protein associations involving these structural components remain poorly resolved. Here, we present a time-resolved proximity proteomics resource that maps cytoskeletal and junctional remodeling across 27 baits during the early stages of reprogramming. We identified over 1100 high-confidence interactions, including many not previously reported, capturing the dynamic reorganization of cell architecture. By integrating proximity-dependent biotinylation with quantitative proteomics, we distinguished spatial relocalizations from abundance-driven effects. Dynamic redistributions of actin regulators and desmosomal proteins were observed, and a targeted short interfering RNA screen uncovered early acting structural proteins essential for colony formation. Our findings reveal adhesion and cytoskeletal maturation as structural bottlenecks in reprogramming and provide a broadly applicable framework for mapping subcellular remodeling during dynamic cell fate transitions.

11

ChemoTrack: A comprehensive dataset linking single-cell migration trajectories to precisely defined chemotactic signals

Panigrahi, D.;Sakurai, N.;Mijanovic, L.;Versluis, D.;Tweedy, L.;Pearce, P.;Machesky, L.;Insall, R.

2026-06-29 Cell Biology 10.64898/2026.06.28.734951 medRxiv

Top 0.1%

12.1%

Show abstract

Chemotaxis drives cell migration in processes ranging from wound healing to embryonic development and cancer metastasis, yet its quantitative understanding remains limited because responding cells change and degrade attractant gradients, and existing datasets are too small and imprecise to capture stochastic behaviour. We present ChemoTrack, a publicly accessible resource comprising 2 million measurements from 500,000 migration tracks, in which the chemoattractant gradient and concentration experienced by every cell at the time of observation are precisely determined. The dataset includes microscopy images and trajectories spanning a full range of biologically relevant chemotactic conditions. Analysis shows that cells steer according to absolute differences in active receptor number, not fractional receptor occupancy, and maximal chemotaxis is not predicted by half-maximal receptor occupancy. By combining scale, precision and accessibility, ChemoTrack shifts quantitative description of eukaryotic chemotaxis from experimental conditions to the instantaneous chemical signal experienced by individual cells, enabling future mathematical and mechanistic analyses.

12

STARMAP: A 3D-informed framework for mapping functional regions in proteins to regulatory and cellular phenotypes

Shukla, K.; Castro, J.; Cheng, D.; Holley, L.; Brunk, E. C.

2026-05-08 genomics 10.64898/2026.05.05.723010 medRxiv

Top 0.1%

11.4%

Show abstract

Artificial Intelligence (AI) has transformed biology by revealing patterns in large-scale datasets and predicting regulatory relationships. Yet even the most advanced models often fail to identify biologically meaningful mechanisms from statistical associations. This limitation arises not from algorithmic capacity but from the lack of mechanistically grounded input features. Our structure-informed framework Structure-based Topological Analysis of Regulatory and Molecular Activity Patterns (STARMAP) embeds protein three-dimensional structure and population-scale functional genomics data into a unified representation for mechanistic inference. By mapping over 1.5 million naturally occurring variants across [~]1,700 cancer cell lines onto protein structures, STARMAP was able to identify spatial clusters of variation associated with shifts in transcriptional regulatory networks and drug response phenotypes. This approach transforms natural genetic variation into a large-scale, structure-informed screen, enabling systematic discovery of regulatory relationships across the proteome and providing interpretable and testable models of cellular regulation.

13

Stability-driven multi-omics integration for reproducible latent structure

Guan, H.; Gerwen, M. v.; Kim-Schulze, S.; Colicino, E.; Dolios, G.; Petrick, L.

2026-06-27 bioinformatics 10.64898/2026.06.23.734064 medRxiv

Top 0.1%

10.7%

Show abstract

High-dimensional multi-omics data integration offers novel opportunities to characterize complex biological systems. Even though sampling variability frequently compromises findings, particularly in small cohorts, the reproducibility and generalizability of the derived latent structures are insufficiently evaluated. We propose a Stability-driven framework for multi-omics integration that combines sparse generalized canonical correlation analysis with repeated cross-validation, out-of-sample projection, and systematic evaluation of both component-level and feature-level stability. We apply this framework to untargeted metabolomic and Olink targeted inflammation proteomic profiles in a thyroid cancer case-control cohort (n = 162). Our Stability-driven integration identified reproducible metabolomic and proteomic latent components that showed consistent out-of-sample disease associations and tracked temporally structured changes relative to time to diagnosis. The proposed framework provides a generalizable strategy for identifying reproducible latent structures that improve robustness of biological inference in multi-omics studies.

14

Temporal Biodynamics: An AI Platform for Identification of Stage-Relevant Targets and Biomarkers

Natekar, P.; Yao, B.; Mohammad-Taheri, S.; Rusnak, A.; Gort-Freitas, N. A.; Fillatre, J.; Raymond, J. J.; Saksena, S. D.; Lipnick, S.; Sokolov, A.

2026-06-06 bioinformatics 10.64898/2026.06.03.729984 medRxiv

Top 0.1%

10.6%

Show abstract

Temporal modeling of disease progression is poised to revolutionize the process of target identification, leading to better characterization of and intervention at the critical early stages of chronic conditions. Temporal Biodynamics is an artificial intelligence-driven platform that leverages within-tissue heterogeneity in cross-sectional cohorts to assemble a single, continuous trajectory of transcriptomic changes between health and disease. We demonstrate that the platform enriches for known disease-associated genes and proteins by more than 50% over the conventional case-control comparisons. When compared to other published pseudotime methods, our models were better at extracting disease-relevant signals in the presence of confounders and co-morbidities. The Temporal Biodynamics platform enables rich profiling of a disease continuum, providing temporal insights that are otherwise hidden by the traditional discrete staging of chronic diseases. This includes detecting cascades of molecular events, providing clues regarding causality, and increasing confidence in blood-based protein biomarkers using tissue-based context.

15

Learning proteomic disease trajectories with flow matching

Hartman, E.; Karlsson, C.; Malmström, J.

2026-07-13 bioinformatics 10.64898/2026.07.08.737311 medRxiv

Top 0.1%

10.0%

Show abstract

High-throughput proteomics has enabled detailed characterization of molecular states across health and disease. However, biological systems are inherently dynamic and methods for reconstructing continuous proteome changes remain limited. Here, we introduce proteome velocity, a framework for inferring continuous proteome trajectories from cross-sectional or sparsely sampled proteomics data using flow matching, in which a neural network learns velocity fields over proteome space. Proteome velocity estimates how rapidly and in which direction protein abundances change along a biological progression, such as disease. In mouse sepsis, covariate-conditioned velocity models resolved tissue- and pathogen-specific proteome trajectories and identified inflammatory proteins with distinct temporal activation patterns across infection routes and organ systems. In clinical COVID-19 plasma proteomes, inferred trajectories separated into distinct velocity programs associated with disease severity. These results show how generative trajectory models can transform cross-sectional proteomics data into interpretable, protein-resolved representations of molecular progression.

16

Joint Variable Selection for Omic Biomarkers in Time-to-Event Data

Bajzik, J.; Depope, A.; Zolfimoselo, Y.; Sharipov, A.; Lesayova, A.; Klein, H.; Richmond, A.; Vernardis, S.; Grauslys, A.; Andrejev, S.; Zelezniak, A.; Ralser, M.; Marioni, R.; Mondelli, M.; Robinson, M. R.

2026-05-04 bioinformatics 10.64898/2026.04.30.721585 medRxiv

Top 0.1%

9.9%

Show abstract

The incidence of the vast majority of neurodegenerative, cancer, and metabolic diseases generally increases exponentially with age. In large-scale biobanks, linking time-to-diagnosis information in electronic health records to multiple genomic ("multiomics") measures has the potential to reveal the genes and biological pathways involved in the disease onset and progression. To date, association testing has commonly been conducted by testing one variable at a time using semiparametric Cox proportional hazards (CoxPH) models, which ignores correlation structure and increases the risk of false discoveries. To address these issues, we introduce a novel fully parametric Bayesian computational method, vampW, based on the Vector Approximate Message Passing framework applied to a Weibull model. vampW jointly models correlated features, while providing an interpretable hazard structure, producing a continuous survival curve, and incorporating prior knowledge. In an extensive simulation study, we demonstrate that joint modeling of omics data and time-to-event outcomes with vampW, substantially reduces false discoveries in comparison to marginal testing and other forms of joint CoxPH models. In 53,018 individuals from the UK Biobank, vampW identifies 219 protein associations with 24 disease outcomes, most of which are not among the top marginal discoveries. We further correct protein levels for exponential age effects, identifying 1,308 associations and highlighting the sensitivity of the analysis to age-correction methodology. Our findings replicate in independent cohorts using different measurement technologies, within data from Iceland and a novel Generation Scotland proteomics dataset. vampW also achieves significant improvement in the prediction of disease onset times: across 14 outcomes, it reduces the root mean squared error by over 32% and 26%, when compared to CoxPH variants and the deep learning approach DeepSurv, respectively, while maintaining predictive utility in minority populations. In summary, vampW offers accurate and interpretable variable selection and out-of-sample prediction within a single computational framework, making it a powerful tool for dissecting the genomic architecture of common complex disease onset.

17

Adenylyl cyclases combinatorially integrate opposing dopamine receptor signals

Gregrowicz, J.; Elowitz, M. B.

2026-07-13 systems biology 10.64898/2026.07.10.737756 medRxiv

Top 0.1%

9.8%

Show abstract

Dopamine receptors are divided into two families which exert opposing effects on the second messenger cyclic AMP (cAMP). While most neuronal cell types express a single receptor subtype, some neurons co-express opposing receptor subtypes. It remains unclear how these cells could resolve simultaneous stimulatory and inhibitory inputs. Here, we introduce a multiplexed assay that quantifies surface receptor abundance and dynamic cAMP output in single cells. Using this assay, together with mathematical modeling, we demonstrate that signals from opposing receptor subtypes are integrated flexibly by downstream adenylyl cyclases (ACs) rather than at the receptor level. Because AC isoforms exhibit unique biochemical properties, a cells AC expression profile determines whether conflicting inputs are cancelled, suppressed, or amplified. Brain transcriptome analysis indicates that co-expression of opposing dopamine receptors is associated with expression of specific AC isoforms predicted to sustain signaling during multi-receptor activation. Our results show that dopamine signal integration depends on the expression profiles of receptors and AC isoforms in a predictable way.

18

Comprehensive evaluation of LLM capabilities for interpretation and analysis of genome-scale metabolic models in metabolic engineering

Yeoh, J. W.; Patro, C. P. K.; Wong, L.; Poh, C. L.

2026-06-08 systems biology 10.64898/2026.06.03.730004 medRxiv

Top 0.1%

9.7%

Show abstract

Genome-scale metabolic models (GSMs) underpin pathway and strain engineering by linking genes to metabolic reactions and enabling system-level simulation of cellular fluxes and intervention effects, yet end-to-end analysis workflows remain fragmented, expert-demanding, and slow to adapt. Large language models (LLMs) could transform this landscape, lowering the barrier by explaining concepts, interpreting GSM files, and turning natural-language instructions into valid analysis code, thereby substantially mitigating the time, effort, and expertise required. However, their reliability for domain-specific tasks remains unexplored. Here, we delivered a systematic benchmark of four leading LLMs (GPT-4, Gemini, Claude, DeepSeek-R1) across four task areas central to metabolic engineering: domain knowledge, metabolic flux prediction, pathway construction, and flux optimization. For benchmarking, we introduced a standardized, rubric-based evaluation framework that uses multi-LLM automated scoring (an ensemble of LLM-as-a-judge assessments) and two distinct sets of nine task-tailored metrics (domain vs coding-focused tasks), rated on a 1-5 scale (up to 45 per task), covering scientific validity and code executability where applicable. Across tasks, we reveal consistent strengths (conceptual explanation, code synthesis) and critical failure modes (e.g., context window limitations, incorrect identifier assumptions, strain-dependent reasoning errors, and errors in domain-specific algorithms). In aggregate, DeepSeek-R1 led in domain tasks, narrowly edging GPT-4, Claude, and Gemini, demonstrating that conceptual biological logic remains highly invariant across architectures. In contrast, Gemini achieved the highest score for coding tasks, distinguished by functional execution and excelled in error handling, documentation, and readability, followed by GPT-4, Claude, and DeepSeek. We also evaluated LLM self-inspection capability by injecting subtle, consequential faults: a stoichiometric sign error causing mass imbalance and an omitted pathway reaction. We reveal that conversational "blind search" prompting completely fails to localize these network faults. Instead, robust error localization requires prompts reframed with domain-informed constraints that force the LLM to leverage tool-assisted code procedures, such as COBRApy mass-balance functions. Together, this work establishes an evidence-based baseline for LLM-enabled GSM analysis, providing actionable guidance for building reliable, automation-ready workflows for pathway and strain design. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=88 SRC="FIGDIR/small/730004v1_ufig1.gif" ALT="Figure 1"> View larger version (28K): org.highwire.dtl.DTLVardef@442a55org.highwire.dtl.DTLVardef@1374927org.highwire.dtl.DTLVardef@a3b7a8org.highwire.dtl.DTLVardef@6ea308_HPS_FORMAT_FIGEXP M_FIG C_FIG

19

Clinical Trial and Ontology-Derived Positive and Negative Benchmark Datasets for Drug Repurposing Across Rare Diseases

Ravandi, C. B.; Mowrey, W.; Chatterjee, A.; Khanshan, F.; Haddadi, P.; Mobarec, J. C.; Lambden, S.; Eliassi-Rad, T.; Ricchiuto, P.; Risa, G.

2026-07-08 systems biology 10.64898/2026.06.15.732135 medRxiv

Top 0.1%

9.6%

Show abstract

Evaluating the potential applications of a medicine is a fundamental challenge in drug development. There is a lack of standardized, decision-oriented benchmarks that test whether computational models can generalize therapeutic hypotheses across diseases in ways that reflect real-world pharmaceutical investment decision making. To address this gap, we introduce two complementary resources: the Indication Expansion Investment Decision Network (IxIDN) and the Orphanet Rare Disease Ontology Negative-network (ORDON). IxIDN is a clinical-trial-derived positive benchmark constructed by projecting drug-disease associations from pharmaceutical clinical trials into a disease-disease network; each edge connects disease pairs that have entered clinical trials for the same drug, thereby capturing cases when concrete indication-expansion decisions have been made. The current release contains 574 rare diseases and 5,336 edges. In contrast, ORDON serves as a stringent, biology-aware negative benchmark derived from the authoritative Orphanet Rare Disease Ontology. It identifies maximally distant disease pairs according to curated hierarchical structure and genetics-linked inheritance patterns, providing 793 rare diseases and 5,000 edges that represent high-separation negative candidates across therapeutic areas. Together, IxIDN and ORDON enable rigorous cross-evidence generalization from clinical trials to disease ontology, testing for Disease-Disease Association Learning (DDAL), a core task for mechanism-centered drug repurposing and indication expansion. All data are publicly available with detailed metadata, enabling reproducible evaluation of models on transparent, decision-relevant benchmarks.

20

Supervised restricted data fusion with common, local & distinct components

White, F.; van der Ploeg, G. R.; Heintz-Buschart, A.; Dong, L.; Bouwmeester, H.; Smilde, A.; Westerhuis, J.

2026-05-04 systems biology 10.64898/2026.04.30.721639 medRxiv

Top 0.1%

9.6%

Show abstract

In multi-block data, the dominant sources of variation are not always most relevant to a response of interest, meaning that purely exploratory decompositions may fail to recover subtle but important response-associated structure. We introduce PESCAR, a supervised extension of Penalised Exponential Simultaneous Component Analysis (PESCA) that incorporates response information directly into the estimation of common, local, and distinct (CLD) structure across multiple data blocks. This allows simultaneous multiblock decomposition and response variable influenced recovery of latent structure. Through simulation studies, we show that PESCAR can detect weak response-related components across a range of settings, including different noise levels and model-rank mis-specification. Applied to a real multi-omics dataset, PESCAR recovers biologically meaningful response-associated patterns and retains interpretable block structure. We further demonstrate that sparsity in the fitted loading matrices admits a hypergraph-based interpretability layer, summarising overlapping support patterns across components and blocks. These results show that direct incorporation of response information into multiblock decomposition can improve detection of subtle relevant signal and facilitate interpretation in complex systems.